Document classification with weighted supervised n-gram embedding

ABSTRACT

Methods and systems for document classification include embedding n-grams from an input text in a latent space, embedding the input text in the latent space based on the embedded n-grams and weighting said n-grams according to spatial evidence of the respective n-grams in the input text, classifying the document along one or more axes, and adjusting weights used to weight the n-grams based on the output of the classifying step.

RELATED APPLICATION INFORMATION

This application claims priority to provisional application Ser. No.61/492,228 filed on Jun. 1, 2011, incorporated herein by reference. Theapplication further claims priority to provisional application Ser. No.61/647,012, filed on May 15, 2012, incorporated herein by reference.

BACKGROUND

1. Technical Field

The present invention relates to document classification and, moreparticularly, to document classification using supervised weightedn-gram embedding.

2. Description of the Related Art

The task of document classification is defined as automatic assignmentof one or more categorical labels to a given document. Examples ofdocument classification include topic categorization, sentimentanalysis, and formality studies. A document may include a sentence,paragraph, or any snippet of text—the term is defined herein toencompass all such objects.

Previous techniques applied to this task are either generative ordiscriminative supervised methods. Discriminative documentclassification techniques commonly rely on the so-called “bag-of-words”(BoW) representation that maps text articles of variable lengths into afixed-dimensional vector space, parameterized by a finite vocabulary.The BOW model treats a document as an unordered collection ofword-features and utilizes the distribution of the words as the primaryevidence for its classification. The “bag-of-unigrams” is the mostcommon form of BoW representation that utilizes a word dictionary as itsvocabulary.

Some classification attempts have employed short phrases as being moreeffective than single words (unigrams) for the task. Extending the“bag-of-unigrams” model by incorporating n-grams (a contiguous sequenceof n words in the vector space representation of the text). However, thecomplexity of modeling n-grams grows exponentially with the dictionarysize. Considering the n-gram cases for an English word dictionary

of size |

, bi-gram and trigram representations of text relate to |

² and |

³ free parameters.

Despite the simplicity and relative success of document classificationusing n-gram features, previous models disregard all the spatial andordering information of the n-grams—such information is important fortext data. In the example of sentiment analysis, phrases with strongpolarity might be more important in deciding the polarity of the wholedocument. For example, a document with the phrase “generally good” atits beginning is very likely to be a positive sentiment. When the samephrase appears in the middle of another text, the document is lesslikely to be a positive comment. Similarly, the start and the end ofsentences or paragraphs in an on-line news article might contain morecritical and subjective information than its other parts. To completelycapturing such relationships would require full semantic understanding,which is beyond the current state of technology.

SUMMARY

A method for document classification includes embedding n-grams from aninput text in a latent space; embedding the input text in the latentspace based on the embedded n-grams and weighting said n-grams accordingto spatial evidence of the respective n-grams in the input text;classifying the document along one or more axes using a processor; andadjusting weights used to weight the n-grams based on the output of theclassifying step.

A system for document classification includes an n-gram embedding moduleconfigured to embed n-grams from an input text in a latent space; adocument embedding module configured to embed the input the input textin the latent space based on the embedded n-grams, weighted according tospatial evidence of the respective n-grams in the input text; aclassifier configured to classify the document along one or more axesusing a processor; and a weight learning module configured to adjust theweights for the n-grams based on the output of the classifying step.

These and other features and advantages will become apparent from thefollowing detailed description of illustrative embodiments thereof,which is to be read in connection with the accompanying drawings.

BRIEF DESCRIPTION OF DRAWINGS

The disclosure will provide details in the following description ofpreferred embodiments with reference to the following figures wherein:

FIG. 1 is a block/flow diagram of a method for document classificationaccording to the present principles.

FIG. 2 is a block/flow diagram of a method for embedding n-grams in alatent space.

FIG. 3 is a block/flow diagram of a method for weighting n-grams tomodel a document according to the present principles.

FIG. 4 is a block/flow diagram of a training procedure that producesweights for document embedding according to the present principles.

FIG. 5 is a block/flow diagram of an alternative embodiment of weightingn-grams according to the present principles.

FIG. 6 is a diagram of a system for document classification according tothe present principles.

DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS

The present principles provide a unified deep learning framework forusing high-order n-grams and for exploring spatial information. Thisgoes beyond the bag-of-words (BOW) strategy for document classification.First, a supervised embedding mechanism is provided to directly modeln-grams in a low-dimensional latent semantic space. Then, to achieve afine-grained model of target documents, a controller module is trainedby spatially re-weighting the sub-parts of each document. The documentrepresentation is then provided to a classifier that is trained fortarget classification tasks. A deep neural network learns the parametersof the latent space, the article modeling layers, and the classifierjointly in one end-to-end discriminative framework.

Compared to BON strategy using feature selection or the n-gram embeddingbased method, an advantage of the present principles is that, thewhole-document modeling using the controller module provides a spatialweighting scheme for different subparts using the supervisedclassification signals. All layers are optimized for the target task andneed little human intervention, thereby making the system totallyadaptable for other tasks.

Before addressing the particulars of the present principles, thefollowing notations are used throughout.

denotes an underlying word (unigram) dictionary and S denotes the set ofall finite length sequences of words from

. The operator |.| denotes the cardinality of a set. An input textsequence of length N will be denoted as x=(w₁, . . . w_(N)), wherew_(j)ε

, xεS and j indicates the j-th position in x. Formally, the basic BOWrepresentation applies a mapping φ(•) to a text string x and converts itinto a feature space of fixed dimension (e.g., |

). It is in this space that a standard classifier such as linearperceptron or support vector machine can be applied. The mapping φ:S→R^(M) takes words sequences in S and maps them to a finite dimensionfeature space. The document labels form a set

={1,K,C}. For example, C=2 denotes sentiment classes such as “positive”or “negative.” A labeled training-set with training labels from

is denoted as X={{(x_(i),y_(i))}_(i=I,K,L)|x_(i)εS, y_(i)ε

|}.

Referring now to FIG. 1, a diagram showing a method for documentclassification is provided. Block 102 inputs the raw text of thedocument. As noted above, this document may be a text sequence of anylength broken into a set of unigrams separated by some appropriate token(e.g., a space or punctuation). Block 104 uses a sliding window todivide text into n-grams, creating one n-gram for each set of nconsecutive words in the document. Each n-gram has a correspondingfeature vector that characterizes the n-gram.

Block 106 uses factors to weight the n-gram vectors according to whereeach n-gram is located within the document. For example, n-grams thatare located near the beginning or end of the document may have greaterweight than those n-grams located in the middle. Block 108 uses theweighted vectors to obtain a fixed-dimension representation of thedocument, a process called “embedding.” The resulting vectorrepresentation of the text document is used in block 110 to classify thedocument according to a predetermined classification system. Forexample, a classifier may discriminate between “positive” and “negative”statements, or may instead use any well-defined classification system.Block 112 uses the results of the classification to “backpropagate” andtrain the weights and document embeddings used in blocks 106 and 108.

Embodiments described herein may be entirely hardware, entirely softwareor including both hardware and software elements. In a preferredembodiment, the present invention is implemented in software, whichincludes but is not limited to firmware, resident software, microcode,etc.

Embodiments may include a computer program product accessible from acomputer-usable or computer-readable medium providing program code foruse by or in connection with a computer or any instruction executionsystem. A computer-usable or computer readable medium may include anyapparatus that stores, communicates, propagates, or transports theprogram for use by or in connection with the instruction executionsystem, apparatus, or device. The medium can be magnetic, optical,electronic, electromagnetic, infrared, or semiconductor system (orapparatus or device) or a propagation medium. The medium may include acomputer-readable storage medium such as a semiconductor or solid statememory, magnetic tape, a removable computer diskette, a random accessmemory (RAM), a read-only memory (ROM), a rigid magnetic disk and anoptical disk, etc.

A data processing system suitable for storing and/or executing programcode may include at least one processor coupled directly or indirectlyto memory elements through a system bus. The memory elements can includelocal memory employed during actual execution of the program code, bulkstorage, and cache memories which provide temporary storage of at leastsome program code to reduce the number of times code is retrieved frombulk storage during execution. Input/output or I/O devices (includingbut not limited to keyboards, displays, pointing devices, etc.) may becoupled to the system either directly or through intervening I/Ocontrollers.

Network adapters may also be coupled to the system to enable the dataprocessing system to become coupled to other data processing systems orremote printers or storage devices through intervening private or publicnetworks. Modems, cable modem and Ethernet cards are just a few of thecurrently available types of network adapters.

The present principles are exemplified using common text classificationtasks: sentimental text binary classification and news textcategorization. Sentimental text binary classification predicts a binarysentiment, positive or negative, expressed by a given document, whereasnews text categorization predicts the semantic class a news text piecebelongs to.

To overcome the dimensionality burdens that high-order models impose,the present principles project n-grams to a low-dimensional latentsemantic space at block 104 using only unigram word dictionaries.Associated parameters are estimated with a bias for targetclassification. This projection of n-grams can be performed with a wordembedding or with direct phrase embedding.

Γ denotes the vocabulary of n-grams in an input corpus, with each n-gramγ_(j)=(w_(j+1, . . . ,) w_(j+n−1)), where j indicates the j-th positionin the input document x. In a bag-of-unigrams representation, thefunction φ(x) would map the input x in a natural way as a (sparse)vector of dimensionality M=|

. Similarly, in a bag-of-ngrams, the function φ(x) maps x to a vectorM=|Γ|-dimensional representation, with |Γ|=O(|

^(n)). Using a sparse vector representation, a unigram word w_(j) can bedescribed as a vector

${e_{w_{j}} = \left( {0,\ldots \mspace{14mu},0,\underset{{at}\mspace{14mu} {index}\mspace{14mu} w_{j}}{1},0,\ldots \mspace{14mu},0} \right)^{T}},$

with an n-gram vector e_(γ) _(j) =[e_(w) _(j) ^(T), e_(w) _(j+1) ^(T), .. . , e_(w) _(j+n−1) ^(T)]^(T) being an n|

-dimensional sparse vector that concatenates vectors of words in then-gram according to their linear ordering.

Referring now to FIG. 2, a method for embedding n-grams with wordembedding is shown. In word embedding, because individual words carrysignificant semantic information, a mapping of each word is projectedinto a real-valued vector space in block 202. Specifically, each wordw_(j)ε

is embedded into an m-dimensional feature space using a lookup tableLT_(E)(.) defined as

$\begin{matrix}{{{{LT}_{E}\left( w_{j} \right)} = {{E \times e_{w_{j}}} = {{E \times \left( {0,{\ldots \mspace{14mu} 0},\underset{{at}\mspace{14mu} {index}\mspace{14mu} w_{j}}{1},\ldots \mspace{14mu},0} \right)^{T}} = E_{w_{j}}}}},} & (1)\end{matrix}$

where Eε

is a matrix with word embedding parameters to be learned. Here, E_(w)_(j) εR^(m) is the embedding of the word w_(j) in the dictionary

and m denotes the target word embedding dimensionality. It is importantto note that the parameters of E are automatically trained during thelearning process using backpropagation, discussed below. AnM-dimensional representation of each n-gram is made up of the nindividual m-dimensional embedding vectors that make up the phrase.

This formation of n-grams is carried through a sliding window of lengthn. Given an n-gram γ_(j) of n adjacent words at the position j, in block204 the word lookup table layer applies the same operation for each wordinside the n-gram, producing z_(γ) _(j) =[E_(w) _(j) ^(T), E_(w) _(j+1)^(T), . . . , E_(w) _(j+n−1) ^(T)]^(T) as an operator that concatenatesembeddings of words in the n-gram γ_(j), resulting in an nm-dimensionalvector z_(γ) _(j) The embedding of the γ_(j) can then be defined as

p _(γ) _(j) =h(F×z _(γ) _(j) )=h(F×[E _(w) _(j) ^(T) ,E _(w) _(j+1) ^(T), . . . E _(w) _(j+n−1) ^(T)]^(T)),  (2)

where projection matrix FεR^(M×nm) maps the vector z into anM-dimensional latent space and h(•)=tan h(•). The h( ) function is notlimited to the hyperbolic tangent, but may instead be any appropriatefunction that converts an unbounded range into a range from −1 to 1. Inthis manner, word-based embedding constructs a low-dimensional latentembedding for all phrases γ_(j)εx by first projecting teach word into alatent space in block 202, followed by a second projection to obtain thelatent embedding of each n-gram in block 204.

With the goal of classifying the whole document into certain classes,the whole-document may be represented based on its included n-grams. Thestructured spatial patterns in natural language text could not becaptured by the unstructured “bag” assumption of BOW.

The n-gram embedding computes a low-dimensional representation of alln-grams in a given text. Some function φ(•) is used to compress theinformation in a text document of variable length to a finite dimensionfeature space (a document embedding vector). The class prediction for aninput text is then computed using a function g(•), defined in theresulting low-dimensional space. While there are many possibilities tocombine latent n-grams into a document embedding vector, an averagingstrategy is described herein. Formally, the document representation isdefined as:

$\begin{matrix}{{{\varphi (x)} \equiv d_{x\;}} = {\frac{1}{N}{\sum\limits_{j = 1}^{N}p_{\gamma_{j}}}}} & (3)\end{matrix}$

where d_(x)εR^(M) and x=(w₁, . . . , w_(N)). In other words, d_(x) isthe centroid of the vector associated with n-grams of the document x.Using sentiment classification as a test case, the sentiment polarity ofa document is intuitively related to the aggregated semantic or polarityof all its n-grams. In other words, the more positive n-grams present inthe document, the more likely it is for the document to express apositive opinion. While there are many possibilities for the aggregationfunction, a mean value function provides a good summarization of thedocument's sentiment in this latent space. One can also use a maximizingfunction that selects the maximum value along each latent dimension.

Referring to FIG. 3, a method for reweighting n-grams to model adocument is shown. The function φ(•) is implemented herein as a weightedsum. The weights for each n-gram γ_(j)εx are learned based on then-gram's position in the text, as determined in block 302. The weightsare used to combine the latent embeddings of γ_(j) into a document-levelrepresentation. Specifically, a latent embedding of document x in thelatent space is defined as:

$\begin{matrix}{{{{\varphi (x)} \equiv d_{x}} = {\sum\limits_{j = 1}^{N}{q_{j} \times p_{\gamma_{j}}}}},} & (4)\end{matrix}$

where d_(x)εR^(M), x=(w₁, . . . , w_(N)), and h(•)=tan h(•). The convexcombination parameter for the phrase γ_(j) is initially defined asq_(j)=1/N, and is subsequently learned based on the location of thephrase γ_(j) in block 304. The weight of every γ_(j) is modeled as ascalar q_(j) using the following mixture model. Let γ_(j)εx, |x|=N andjε{1 . . . N} indicate the position of an n-gram in x, and define theweight associated with γ_(j) as:

$\begin{matrix}{{q_{j} = {\frac{1}{Q}{\sum\limits_{k = 1}^{K}{{sigmoid}\left( {{a_{k} \cdot \frac{j}{N}} + b_{k}} \right)}}}},} & (5)\end{matrix}$

where a_(k),b_(k) are parameters to be learned, Q=Σ_(j=1) ^(N)q_(j), Kspecifies the number of mixture quantities, and sigmoid(•) is anon-linear transfer function. The spatial re-weighting attempts tocapture longer “trends” within each document and preserves spatialinformation for phrases within a document. The parameters in equation 5include two vectors, a and b. They are learned/identified by a“backpropagation” training process. The n-gram vectors p_(γ) _(j) areeach multiplied by their respective combination parameters q_(j) inblock 306, and combined in block 308 to form the document vector d_(x).

An alternative embodiment for modeling the weight of every γ_(j) is touse both the position evidence and the content of this n-gram. Letγ_(j)εx, |x|=N and jε{1 . . . N} indicate the position of an n-gram inx. Utilizing another linear project layer as,

${q_{j} = {h\left( {B \times \left\lbrack {\frac{j}{N_{x}},p_{\gamma_{j}}} \right\rbrack} \right)}},$

where the projection matrix BεR^((M+1)×1) maps the vector

$\left\lbrack {\frac{j}{N_{x}},p_{\gamma_{j}}} \right\rbrack$

(concatenating relative position and the n-gram representation) into ascalar and h(•)=tan h(•). The resulting weight value q_(j) for phraseγ_(j) considers not only the spatial evidence, but also the n-gramitself. The parameters in equation 6 include the projection matrix Bwhich is identified (learned) by the “backpropagation” training process.

Referring now to FIG. 5, an alternative embodiment for modeling thespatial evidence in a document is shown. This method models longersequence patterns. Considering the fact that text documents havevariable length, each document is uniformly split into K equal subpartsaccording to an original linear ordering in block 502. In the presentexample, K=4, but it is contemplated that any appropriate division maybe used. Given a document x=(w₁, . . . , w_(N)) with N words, wherew_(j) describes the word at the position j, the split produces:

x=(w ₁ , . . . ,w _(N))=[x ₁ ,x ₂ , . . . ,x _(k) , . . . x _(K)],

where the notation [•] denotes the concatenation of subsequences into anN-dimensional vector. For kε{1, . . . K}, it can be shown that

$x_{k} = {\left( {w_{{{({k - 1})}{\lfloor\frac{N}{K}\rfloor}} + 1},K,w_{k{\lfloor\frac{N}{K}\rfloor}}} \right).}$

As with FIG. 3, for each sub-part, block 504 determines relativepositions for the n-grams in the document, block 506 weights the n-gramsusing a combination parameter, block 508 multiplies the n-gram vectorsby their respective combination parameters, and block 510 adds theweighted vectors in each sub-part to form sub-part vectors. As above,equation 6 may be used to calculate the sub-part embeddingrepresentations p_(k) ^(T), with the representation of thewhole-document being defined as,

d _(x) =h([p ₁ ^(T) ,p ₂ ^(T) , . . . p _(K) ^(t)]^(T))  (7)

which is a vector having size KM. The document vector is built from theconcantation of the embedding vectors from its K subsequences in block512. This captures even “long range” spatial patterns, where eachsubsequence is normally much longer than a typical n-gram. Also theconcantation operation keeps the original linear ordering betweensubsequences that are useful for the classification of a document.

Document classification in block 110 may be performed using a linearprojection function. Given the document embedding d_(x), and C candidateclasses, W represents a linear projection matrix which projects theembedding representation into a vector with size C. The output vectorfrom this classifier layer is,

g(x)=(W×d _(x))  (8)

The predicted class label can be calculated as

$\underset{i \in {\{{1\mspace{11mu} \ldots \mspace{11mu} C}\}}}{\arg \; \max}{\left( {g(x)}_{i}^{T} \right).}$

This predicted class belongs to one of the candidate classes {1 . . .C}.

The last layer is measurement of how different the predicted class of adocument compares to its true class label:

$\begin{matrix}{{l\left( {x,y_{TRUE}} \right)} = {{- \log}\frac{\exp \left( {g(x)}_{y_{TRUE}}^{T} \right)}{\sum\limits_{i \in {\{{1\mspace{11mu} \ldots \mspace{11mu} C}\}}}{\exp \left( {g(x)}_{i}^{T} \right)}}}} & (9)\end{matrix}$

The whole network is trained by minimizing the loss function summingover a set of training examples which includes a number of documents andtheir true class labels X={{(x_(i),y_(i))}_(i=1,K,L)|x_(i)εS, y_(i)ε

}. The training procedure tries to search for the parameters (Table 1below) to optimizes the following total loss (commonly named the“negative log likelihood” loss),

${L(X)} = {\sum\limits_{{({x_{i},y_{i}})} \in X}{l\left( {x_{i},y_{i}} \right)}}$

Stochastic gradient descent (SGD) may be used to optimize the aboveloss. In SGD, for a training set X, instead of calculating a truegradient of the objective with all the training samples, the gradient iscomputed with a randomly chosen training sample (x,y_(TRUE)) εX with allparameters being updated to optimize equation 9. SGD optimization methodis scalable and proven to rival the performance of batch-mode gradientdescent methods when dealing with large-scale datasets.

Backpropagation is used to optimize the loss function and learn theparameters (called “training”) of each layered module of the network.Each step of the network (word embedding, n-gram embedding, spatialre-weighting, and classification) can be written more generally as a setof functions, l_(x)=f_(T)(f_(T−1)( . . . (f₁(x)) . . . )), where l_(x)denotes a loss on a single example x, and the first layer f_(T) is theloss function defined above in equation 9, evaluated using a singletraining example x. Each function has a set of parameters θ_(i) asdescribed in Table 1. For example, θ₁={E}, θ₂={F}, and θ₄={W} orθ₄={a,b}. The overall system can be represented in a 6-layer networkarchitecture (T=6):

Parameters to Associated Level Layer Learn Equation f₁ word embedding EEq(1) f₂ n-gram embedding F Eq(2) f₃ Transfer layer - tanh — — f₄Spatial re-weighting of {a, b} B Eq(5) Eq(6) ngrams (documentrepresentation) f₅ classifier W Eq(8) f₆ loss function — —

Table 1

Each parameter listed in Table 1 corresponds to a parameter from one ofthe above equations. In particular, the parameter E is used in equation1, the parameter F is used in equation 2, the parameters a and b areused in equation 5, the parameter B is used in equation 6, and theparameter W is used in equation 8.

For a layer f_(i), iε[1,T], the derivative

$\frac{\partial l}{\partial\theta_{i}}$

is used for updating its parameter set θ_(i) uses the delta rule

$\left. \theta_{i}\leftarrow{\theta_{i} - {\lambda \cdot \frac{\partial l}{\partial\theta_{i}}}} \right.$

where λ is a small constant called learning rate which influences thespeed of learning the parameters. The delta rule is derived throughgradient descent which tries to optimize the parameters by minimizingthe error (loss) in the output of each single-layer module. It can beseen that:

${\frac{\partial l}{\partial\theta_{i}} = {\frac{\partial f_{T}}{\partial f_{i}} \times \frac{\partial f_{i}}{\partial\theta_{i}}}},$

and the first factor on the right can be recursively calculated:

${\frac{\partial f_{T}}{\partial f_{i}} = {\frac{\partial f_{T}}{\partial f_{i + 1}} \times \frac{\partial f_{i + 1}}{\partial f_{i}}}},{where}$$\frac{\partial f_{T}}{\partial f_{i + 1}}\mspace{14mu} {and}\mspace{14mu} \frac{\partial f_{i}}{\partial\theta_{i}}$

are Jacobian matrices. Backpropagation is a generalization of the deltarule which provides an efficient strategy to perform parameter learningand optimizing multi-layered network modules together.

Referring now to FIG. 4, a training procedure is shown that sets theweights described above. Block 402 initializes the parameters θ_(i) forthe associated functions ƒ_(i). Decision block 404 marks thedetermination of whether the parameters have converged. If theparameters have converged, then training ends. If not, block 406randomly samples a data point and label in the text document. Block 408calculates the loss l based on the randomly sampled data point andlabel.

Block 410 begins a loop by initializing an iterator index i to zero andan accumulator variable to one. Block 412 multiplies the accumulator by

$\frac{\partial f_{i}}{\partial\theta_{i}}$

and stores that value as

$\frac{\partial l}{\partial\theta_{i}}.$

Block 414 weights

$\frac{\partial l}{\partial\theta_{i}}$

by a factor λ and subtracts the weighted value from the parameter θ_(i),storing the updated parameter. Block 416 multiplies the accumulator by

$\frac{\partial f_{i + 1}}{\partial f_{i}}$

and stores the value as the new accumulator. Block 418 increments i, anddecision block 420 determines whether to continue the loop based onwhether all of the T layers have been updated. If not, processingreturns to block 412. If so, processing returns to block 404 todetermine whether the updated parameters have converged. The overallloop may be performed any number of times, whether until convergence oruntil a maximum number of iterations has been reached.

Referring now to FIG. 6, a system for document classification 600 isshown. The system includes a processor 602 and memory 604, which storeand process input text to produce classifications. The memory 604furthermore stores training sets of text to be used by weight learningmodule 612. An n-gram embedding module 606 embeds sets of n input wordsfrom the input text to the latent space, while document embedding module608 combines the output of the n-gram embedding module to form adocument vector using weights for each n-gram, based on the position ofthose n-grams in the text. Classifier 610 uses the document vector,which characterizes the entire document, to classify the document asbelonging to one or more classes. The weight learning module 612initializes the weights used to embed the document and adjusts thoseweights based on learning training documents and on the output of theclassifier 610.

Having described preferred embodiments of a system and method (which areintended to be illustrative and not limiting), it is noted thatmodifications and variations can be made by persons skilled in the artin light of the above teachings. It is therefore to be understood thatchanges may be made in the particular embodiments disclosed which arewithin the scope of the invention as outlined by the appended claims.Having thus described aspects of the invention, with the details andparticularity required by the patent laws, what is claimed and desiredprotected by Letters Patent is set forth in the appended claims.

1. A method for document classification, comprising: embedding n-gramsfrom an input text in a latent space; embedding the input text in thelatent space based on the embedded n-grams and weighting the n-gramsaccording to spatial evidence of the respective n-grams in the inputtext; classifying the document along one or more axes using a processor;and adjusting weights used to weight the n-grams based on the output ofthe classifying step.
 2. The method of claim 1, wherein embedding theinput text in the latent space is calculated as a weighted sum over theembedded n-grams in the input text.
 3. The method of claim 1, wherein aweight of each n-gram is modeled as an output of a nonlinear functionusing a mixture model on a relative position of the n-gram in the inputtext.
 4. The method of claim 1, wherein a weight of each n-gram ismodeled as a function of a relative position of each n-gram in the inputtext and an embedding representation of each n-gram.
 5. The method ofclaim 1, wherein embedding the input text in the latent space furthercomprises: dividing the input text into sub-parts; forming an embeddedrepresentation of each of the sub-parts based on embedded n-grams ineach respective sub-part; and concatenating the sub-parts to form anembedded representation of the full input text.
 6. The method of claim5, wherein forming the embedded representation of each subpart includescalculating a weighted sum over the n-grams in the subpart.
 7. Themethod of claim 1, wherein embedding the input text in the latent spacecomprises calculating a weighted sum over the n-grams in the input text.8. The method of claim 1, wherein weights for each n-gram are learned byoptimizing over a set of training documents with known class labels. 9.The method of claim 8, wherein the weights are learned using astochastic gradient descent.
 10. The method of claim 1, whereinclassifying includes applying a classification having three or moreclasses.
 11. A system for document classification, comprising: an n-gramembedding module configured to embed n-grams from an input text in alatent space; a document embedding module configured to embed the inputthe input text in the latent space based on the embedded n-grams,weighted according to spatial evidence of the respective n-grams in theinput text; a classifier configured to classify the document along oneor more axes using a processor; and a weight learning module configuredto adjust the weights for the n-grams based on the output of theclassifying step.
 12. The system of claim 11, wherein the documentembedding module is further configured to embed the input text in thelatent space as a weighted sum over the embedded n-grams in the inputtext.
 13. The system of claim 11, wherein a weight of each n-gram ismodeled as an output of a nonlinear function using the mixture model onthe relative position of the n-gram in the input text.
 14. The system ofclaim 11, wherein the weight of an n-gram is modeled as a function ofboth a relative position of the n-gram in the document and an embeddingrepresentation of the n-gram.
 15. The system of claim 11, wherein thedocument embedding module is further configured to: divide the inputtext into sub-parts; form an embedded representation of each of thesub-parts based on embedded n-grams in each respective sub-part; andconcatenate the sub-parts to form an embedded representation of the fullinput text.
 16. The system of claim 15, wherein the document embeddingmodule is further configured to form the embedded representation of eachsubpart by calculating a weighted sum over the n-grams in the subpart.17. The system of claim 11, wherein the document embedding module isfurther configured to form the embedded representation of each subpartby calculating a weighted sum over the n-grams in the subpart.
 18. Thesystem of claim 11, wherein the weight learning module is furtherconfigured to learn weights for each n-gram by optimizing over a set oftraining documents with known class labels.
 19. The system of claim 18,wherein the weight learning module is configured to learn the weightsusing a stochastic gradient descent.
 20. The system of claim 11, whereinthe classifier is configured to apply a classification having three ormore classes.